An Optimal Feature Set for Stylometry-based Style Change detection at Document and Sentence Level

نویسندگان

چکیده

Writing style change detection models focus on determining the number of authors documents with or without known authors. Determining exact contributing in writing a document particularly when contribute short texts form sentence is still challenging because lack standardized feature sets able to discriminate between works Therefore, task identifying best set for all tasks considered important. This paper sought determine tasks; separating several changes (multi-authorship) from any (single-authorship), and location case multi-authorship. We performed exploratory research existing stylometric features level features. Document were extracted used separate single authored multi-authored documents, while answer question To this question, we trained random forest classifier rank separately, applied an ablation test top 15 using k-means clustering algorithm confirm effect these model performance. The study found out that was provided by ensemble including repetitions (num_sentence_repetitions) as most determinant feature, 5-grams, 4-grams, Special_character, sentence_begin_lower, sentence_begin_upper, diversity, automated_readability_index, parenthesis_count, first_word_uppercase, lensear_write_formula, dale_chall_readability, difficult_words, type_token_ratio. These ranked experiment one. On other hand, fifteen based ranks dale_chall_readability grade, check_available_vowel, flesch_kincaid colon_count, verbs, bigrams, alphabets, personal pronouns, coordinating conjunctions, interjections, modals, type_token ratio punctuations_count. Consequently, optimal results features, check_available_vowels, punctuations_counts, parenthesis count, conjunctions colon count.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stylometry-based Fraud and Plagiarism Detection for Learning at Scale

Fraud detection in free and natural text submissions is a major challenge for educators in general. It is even more challenging to detect plagiarism at scale and in online classes such as Massive Open Online Courses. In this paper, we introduce a novel method that analyses the writing style of an author (stylometry) to identify plagiarism. We will show that our system scales to thousands of sub...

متن کامل

Document-to-Sentence Level Technique for Novelty Detection

Novelty identification is accustomed to distinguishing novel data from an approaching stream of documents. In this study, we proposed a novel methodology for document-level novelty identification by utilizing document-to-sentence-level strategy. This work first splits a document into sentences, decides the novelty of every sentence, then registers the record-level novelty score in view of an al...

متن کامل

A Corpus-Independent Feature Set for Style-Based Text Categorization

We suggest a corpus-independent feature set appropriate for style-based text categorization problems. To achieve this, we introduce a new measure on linguistic features, called stability, which captures the extent to which a language element, such as a word or syntactic construct, is replaceable by semantically equivalent elements. This measure may be perceived as quantifying the degree of avai...

متن کامل

using contextual information for unsupervised change detection using multitempolar sar images based on clustering and level set methods

in this research, the framework is presented for unsupervised change detection using multitemporal sar images based on integration clustering and level set methods. spatial correlation between pixels were considered by using contextual information. also as proposed method was used integration of gustafson-kessel clustering techniques (gkc) and level set methods for change detection. using clust...

متن کامل

Feature Set Reduction for Document Classification Problems

With a growing amount of electronic documents available, there is a need to classify documents automatically. In growing text classification applications, important-term selection is a critical task for the classifier performance. Although many different techniques and heuristics have been developed, this paper shows that many of them are just a sub-set of more advanced methods originating in t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International journal of scientific research in computer science, engineering and information technology

سال: 2022

ISSN: ['2456-3307']

DOI: https://doi.org/10.32628/cseit228617